class: center, middle, inverse, title-slide .title[ # Lecture 22 ] .subtitle[ ## Multiple Linear Regression With Categorical Variables ] .author[ ### Psych 10 C ] .institute[ ### University of California, Irvine ] .date[ ### 05/20/2022 ] --- ## Review - Last class we started working with an example using multiple linear regression. -- - We wanted to know if *age* and *height* are good predictors of the blood pressure of a participant. -- - We started by comparing 4 models: -- 1. **Null model:** Blood pressure is constant *regardless* of age and height. -- 1. **Age model:** The expected blood pressure of participants changes *as a linear function of their age*. -- 1. **Height model:** The expected blood pressure of participants changes *as a linear function of their height*. -- 1. **Age + Height model:** The expected blood pressure of participants changes *as a linear function of their age and height*. --- ## Results from the model comparison - We compared these 4 models using the following table: | Model | Parameters | MSE | `\(R^2\)` | BIC | |-------|:----------:|:---:|:-----:|:---:| | Null | 1 | 157.76 | | 256.97| | Age | 2 | 93.46 | 0.41 | 234.7| | Height | 2 | 120.28 | 0.24 | 247.31| | Age + Height | 3 | 47.75 | 0.7 | 205.04| -- - The results indicate that, considering only the continuous variables in our data, the best model is the one that assumes that *both age and height have an effect on the blood pressure of participants*. -- - However, in our analysis we ignored one of our variables included in the complete dataset: the *sex of participants at birth*. --- ## Sex at birth and blood pressure .pull-left[ <img src="data:image/png;base64,#lec-22_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> ] .pull-right[ <img src="data:image/png;base64,#lec-22_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> ] -- - From the first graph there are two things that we can notice, first is that blood pressure levels seem to be different between the two groups, however, the relation between height and blood pressure inside each group is not clear. -- - From the second graph, there seems to be a difference between the two groups, however, the effect of age seems to be approximately the same in both. --- ## Adding categorical variables - Since there seems to be a difference between the two groups, we decide to add this new categorical variable into our linear regression model. -- - In order to do this, first, we'll assign "arbitrary" numeric values to each of the labels. We do this by using the indicator function: `$$z_i = \begin{cases} 0 & \quad \text{if observation i is male}\\ 1 & \quad \text{if observation i is female} \end{cases}$$` -- - According to this definition of the variable, the Male group would be the reference group. The parameter ( `\(\beta\)` ) associated to the variable `\(z_i\)` can now be interpreted as the difference in the expected blood pressure for members of the Female group compared to males. -- `$$y_i \sim Normal(\beta_0 + \dots + \beta_kz_i,\sigma^2)$$` -- So, for people on the Male vs Female group, we would have: `$$y_i \sim Normal(\beta_0 + \dots + \beta_k(0),\sigma^2)$$` `$$y_i \sim Normal(\beta_0 + \dots + \beta_k(1),\sigma^2)$$` --- ## Linear models with categorical variables - For now we will only consider the additive models that take into account our categorical variable as a predictor of blood pressure. -- 1. **Sex model**: *Only the sex at birth* has an effect on blood pressure. `$$y_i \sim \text{Normal}(\beta_0 + \beta_3 z_i,\sigma_5^2)$$` -- 1. **Sex + Age model**: The expected blood pressure of participants is *a linear function of their current age and sex at birth*. `$$y_i \sim \text{Normal}(\beta_0 + \beta_1\text{age}_i + \beta_3 z_i,\sigma_6^2)$$` -- 1. **Sex + Height model**: the expected blood pressure of participants is *a linear function of their current height and sex at birth*. `$$y_i \sim \text{Normal}(\beta_0 + \beta_2\text{height}_i + \beta_3 z_i,\sigma_7^2)$$` -- 1. **Sex + Age + Height model**: the expected blood pressure of participants is a linear function of their age, height and sex at birth. `$$y_i \sim \text{Normal}(\beta_0 + \beta_1\text{age}_i + \beta_2\text{height}_i + \beta_3 z_i,\sigma_8^2)$$` --- ## Adding an indicator variable to our data - The steps that we need to take to add the predictions and errors for each of these new models to our data in order to perform the usual model comparison are the same as before. The only change is that now, we need to start by adding this new variable. -- - We can add our indicator variable `\(z_i\)` using the `mutate()` and `case_when()` functions: ```r pressure <- pressure %>% mutate("sex_id" = case_when(sex_at_birth == "male" ~ 0, sex_at_birth == "female" ~ 1)) ``` -- - This adds a new variable "sex_id" to our dataset that takes a value of 1 when the "sex_at_birth" column has a "female" label, and a value of 0 when the label is "male". --- ## Adding predictions and errors - We will start with the predictions of the model that assumes that only sex at birth is a good predictor of blood pressure. -- ```r # Estimate the parameters of the model via lm betas_sex <- lm(formula = blood_pressure ~ sex_id, data = pressure)$coef # Add predictions and errors pressure <- pressure %>% mutate("prediction_sex" = betas_sex[1] + betas_sex[2] * sex_id, "error_sex" = (blood_pressure - prediction_sex)^2) # Calculate SSE, MSE R^2 and BIC sse_sex <- sum(pressure$error_sex) mse_sex <- 1/n_total * sse_sex r2_sex <- (sse_null - sse_sex) / sse_null bic_sex <- n_total * log(mse_sex) + 2 * log(n_total) ``` --- ## Graph of the model's predictions <img src="data:image/png;base64,#lec-22_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> -- - As we can see, the predictions of this model are just the averages observed across each of the sex groups (Males/Females). --- ## Age + Sex model - The Age + Sex model is a multiple linear regression model. However, this time we will be able to visualize the predictions of the model because we're using a categorical variable as one of its independent variables. ```r # Estimate the parameters of the model via lm betas_as <- lm(formula = blood_pressure ~ age + sex_id, data = pressure)$coef # Add predictions and errors pressure <- pressure %>% mutate("prediction_as" = betas_as[1] + betas_as[2] * age + betas_as[3] * sex_id, "error_as" = (blood_pressure - prediction_as)^2) # Calculate SSE, MSE R^2 and BIC sse_as <- sum(pressure$error_as) mse_as <- 1/n_total * sse_as r2_as <- (sse_null - sse_as) / sse_null bic_as <- n_total * log(mse_as) + 3 * log(n_total) ``` --- ## Predictions for the Age + Sex model <img src="data:image/png;base64,#lec-22_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> - The predictions are different for each group, as indicated by the two lines. However, **the lines are parallel**. This is because this model does not assume an interaction between the variables Age and Sex at birth. --- ## Height + Sex model - The Height + Sex model is another multiple linear regression model, whose predictions we will be able to visualize in a graph. ```r # Estimate the parameters of the model via lm betas_hs <- lm(formula = blood_pressure ~ height + sex_id, data = pressure)$coef # Add predictions and errors pressure <- pressure %>% mutate("prediction_hs" = betas_hs[1] + betas_hs[2] * height + betas_hs[3] * sex_id, "error_hs" = (blood_pressure - prediction_hs)^2) # Calculate SSE, MSE R^2 and BIC sse_hs <- sum(pressure$error_hs) mse_hs <- 1/n_total * sse_hs r2_hs <- (sse_null - sse_hs) / sse_null bic_hs <- n_total * log(mse_hs) + 3 * log(n_total) ``` --- ## Predictions Height + Sex model <img src="data:image/png;base64,#lec-22_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> -- - This model suggests that as the height of the participant increases, blood pressure decreases. This prediction is different from that made by the model that only consiedred height as a predictor. `\((\hat{\beta}_2 =\)` -0.2) --- ## Age + Height + Sex model - For this fourth model we will not be able to visualize its predictions. This is due to the fact that it contains two continuous variables. ```r betas_ahs <- lm(formula = blood_pressure ~ age + height + sex_id, data = pressure)$coef # Add predictions and errors pressure <- pressure %>% mutate("prediction_ahs" = betas_ahs[1] + betas_ahs[2] * age + betas_ahs[3] * height + betas_ahs[4] * sex_id, "error_ahs" = (blood_pressure - prediction_ahs)^2) # Calculate SSE, MSE R^2 and BIC sse_ahs <- sum(pressure$error_ahs) mse_ahs <- 1/n_total * sse_ahs r2_ahs <- (sse_null - sse_ahs) / sse_null bic_ahs <- n_total * log(mse_ahs) + 4 * log(n_total) ``` --- ## Comparing all 8 models - We can add the results for these four new models to the table we had before. We first present the models that have only 1 predictor, then the ones that have 2, and finally, the one model with all 3 predictors. | Model | Parameters | MSE | `\(R^2\)` | BIC | |-------|:----------:|:---:|:-----:|:---:| | Null | 1 | 157.76 | | 256.97| | Age | 2 | 93.46 | 0.41 | 234.7| | Height | 2 | 120.28 | 0.24 | 247.31| | **SEX** | **2** | **89.54** | **0.43** | **232.56**| | Age + Height | 3 | 47.75 | 0.7 | 205.04| | **Age + SEX** | **3** | **27.04** | **0.83** | **176.6**| | **Height + SEX** | **3** | **89.25** | **0.43** | **236.31**| | **Age + Height + SEX** | **4** | **26.7** | **0.83** | **179.88**| --- ## Model comparison - Previously, when we only took into account the continuous variables in the study, we found that the best model was the one that assumed that both the age and height of participants affected their expected blood pressure. -- - Now, when we take into account the sex at birth of participants, we end up with two models that are better: -- - First, the model including all 3 predictors was better than the model with only height and age. -- - However, the best model seems to be the one that only assumes that age and sex at birth have an effect on the expected blood pressure. -- - As we saw in the graph, once we take into account sex at birth, the parameter associated with height was found to be very close to 0. -- - This is because when we used only the height as a predictor, the effect of the sex of the participant at birth was being **confounded**. This means that we could not differentiate between an effect of only height versus an effect of a hidden variable (sex at birth) that was highly correlated with height. --- ## Confounders - This is not an uncommon case. Many times there are variables that seem to be relevant, but their effect is actually due to their association to other unexplored variables. -- - In this example, height was a good predictor of blood pressure only because it carried some information about the sex at birth of the participants. -- - When we include the information about the sex at birth of a participant into the model, then the association between height and blood pressure almost disappears, this is because the information is now redundant. -- - Now that we know that the best model is one that assumes that blood pressure is a linear model of age and sex at birth, we can interpret the values of the parameters that we found. --- ## Interpretation of the parameters - The model that we selected (age + sex) has 3 parameters: the intercept `\(\beta_0\)`, the slope associated with age `\(\beta_1\)` and the slope associated with sex at birth `\(\beta_3\)`. -- - For the parameter `\(\beta_0\)` we can say: The estimated value of the intercept was 100.2. This means that the average blood pressure of a **male** that is 0 years old (new born) is approximately 100.2. -- - For the slope associated with age: The estimated value of the slope associated with age was 0.55. This means that on average, the blood pressure of participants increases 0.55 per year. -- - Last, for the slope associated with sex at birth we say: The estimated value of the slope associated with sex at birth was -16.3, this indicates that the blood pressure of **female** participants in the study was approximately 0.55 lower than the males regardless of their age. --- - Homework link: ```r link <- "https://raw.githubusercontent.com/ManuelVU/psych-10c-data/main/homework5.csv" ```